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Background: Routinely collected data from tuberculosis surveillance system can be used to investigate and monitor 
the irregularities and abrupt changes of the disease incidence. We aimed at using a Hidden Markov Model in order to 
detect the abnormal states of pulmonary tuberculosis in Iran. 

Methods: Data for this study were the weekly number of newly diagnosed cases with sputum smear-positive pulmo- 
nary tuberculosis reported between April 2005 and March 201 1 throughout Iran. In order to detect the unusual states 
of the disease, two Hidden Markov Models were applied to the data with and without seasonal trends as baselines. 
Consequentiy, the best model was selected and compared with the results of Serfling epidemic threshold which is typi- 
cally used in the surveillance of infectious diseases. 

Results: Both adjusted R-squared and Bayesian Information Criterion (BIC) reflected better goodness-of-fit for the 
model with seasonal trends (0.72 and -1336.66, respectively) than the model without seasonalit)' (0.56 and -1386.75). 
Moreover, according to the Serfling epidemic threshold, higher values of sensitivit)' and specificit)' suggest a higher 
validity for the seasonal model (0.87 and 0.94, respectively) than model without seasonality (0.73 and 0.68, respective- 



Conclusion: A two-state Hidden Markov Model along with a seasonal trend as a function of the model parameters 
provides an effective warning system for the surveillance of tuberculosis. 
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Abstract 



ly)- 



The necessity of tuberculosis (TB) control as one 
of the top ten mortality causes is completely ap- 
parent (1-3). Despite the dramatic advancements 
in case detection and effective treatment methods 
against TB over the past six decades (4, 5), it still 
remains as a great area of concern resulting from 
the emergence of both multidrug-resistant (MDR) 
(6-8) and HIV-associated TB (9, 10) in many areas 
of the world (11). It is estimated about one-third 
of the people carry the TB baciUi, with almost 
nine million new infections and two miUion 
deaths every year due to TB (1). Moreover, 2.5% 



of the global burden of disease is attributed to TB 
(12). In Iran, although implementing National 
Tuberculosis Program (NTP) (13) caused TB cur- 
rently to be weU-controUed and the associated 
mortality exhibits a downward trend during past 
two decades (14), it continues to be a serious me- 
nace to public health (15, 16). Increasing the 



tries with high burden of the disease like Afgha- 
nistan, Pakistan and Iraq create particular chal- 
lenges for the disease control (19, 20). In 2010, a 
total of 11,092 old and new cases of TB were re- 
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ported in Iran and of these cases 246 patients 
(around 2.2%) were HIV positive (21). 
TB surveUlance and preventing further spread of 
the disease requires full understanding of the bio- 
logical factors affecting TB, and also finding ma- 
thematical patterns explaining the mechanism of 
TB transmission through the community (22). Al- 
though TB is not recognized as an infection with 
rapid dynamics (the chance of getting infection 
per contact is low), the risk of transmission is 
higher from patients with sputum smear— positive 
(SS+) (23). Therefore, regarding this assumption, 
it would be acceptable the fact that the number of 
persons getting infection over the next time pe- 
riod depends on the number of infectious cases at 
the current time period. Moreover, some studies 
have Ulustrated variable periods of peak seasonal- 
ity of TB time series with various patterns in dif- 
ferent countries. In particular, some of them have 
reported a higher incidence of the disease in the 
late winter to early spring in the sense that the in- 
door activities in the cold weather is much more 
common than in a warm climate (24, 25). 
The early detection of disease outbreaks has been 
one of the major concerns of all surveillance sys- 
tems. Since two decades ago, traditional surveil- 
lance techniques were replaced by biosurveUlance 
system with the purpose of reducing time delay to 
detect and report outbreaks (26). BiosurveUlance 
provide early warning system of epidemics by 
monitoring the data typically consist of time series 
counts of incident cases of disease, gathered 
monthly, weekly, or more frequently (27). There 
has been already a surge of interest and research in 
using statistical methods for the early detection of 
outbreaks based on the routinely surveillance data. 
Regression techniques, time series analysis, statis- 
tical process control and Bayesian methods are 
some examples of the statistical method which 
have been used for monitoring the epidemiologic 
surveillance of infectious diseases (28). 
Whereas infectious diseases mostly lie into one of 
the two non-epidemic and epidemic phases (29), it 
seems using the concept of finite mixture model is 
preferred to fit models based on a unique distribu- 
tion. The basic idea of using Hidden Markov 
Model (HMM) for monitoring the epidemiologic 



surveillance of infectious diseases was proposed 
by Le Strat and Garret in 1999 (30). They applied 
the model to the time series of flu-Uke disease in- 
cidence rates and poliomyelitis counts and demon- 
strated the ability of HMM in modeling the rou- 
tinely diseases surveillance data. Nevertheless, it 
has been rarely applied in public health systems 
for the same purpose (31). Five years later, Toni 
M. Rath et al. indicated some problems and short- 
comings in Strat's approach and presented some 
modification to their models (32) . 
In this study, HMMs which seems to be an ap- 
propriate tool in this issue were used to monitor 
the anomaly states of the weekly numbers of new- 
ly diagnosed cases with SS+. 

Material and Methods 

Study area and data source 

The data obtained from the national TB surveil- 
lance program of Iran including the entire SS+ 
patients registered at all TB register offices 
throughout Iran. The diagnostic criterion of a new 
SS+ pulmonary TB case was based on the exis- 
tence of one of the following conditions: (a) At 
least two initial sputum smear tests positive for 
Acid-Fast-BaciUi (AFB), or (b) One sputum smear 
test positive for AFB plus radiographic diagnosis 
of active pulmonary TB, or (c) One sputum smear 
positive for AFB plus sputum culture-positive for 
M. TB. Currently, the information of all new diag- 
nosed cases with any type of TB is recorded and 
documented at TB register offices located in all 
counties of Iran and send as a report quarterly to 
the "Administration of Tuberculosis and Leprosy 
Control" of Iran Ministry of Health and Medical 
Education (13). 

The time series of the weekly number of newly 
diagnosed SS+ patients were extracted from April 
2005 to March 2011 using the TB Register soft- 
ware (version 7). The first version of the software 
was released in 2005 in order to improve the qual- 
ity of data and statistical reports resulting from the 
TB surveillance system of Iran. Fig. 1 exhibits the 
time series of the weekly number of newly diag- 
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nosed SS+ patients during the six-year period 
2005-2011. 




2005 2006 2007 2008 2009 2010 2011 



Fig. 1: Weekly number of patients newly diagnosed 
with SS+ pulmonary TB in Iran over 2005-201 1 

Cyclic regression 

In the mid-1960s, Surfling developed a cyclic re- 
gression model to monitor and detect the anomaly 
activities of Pneumonia and Influenza based on 
the weekly excess mortality data attributed to the 
disease. The model characterized the historical 
sequence of the disease time series by combina- 
tion of a linear term with a trigonometric function 
describing the seasonal trend (33). This idea origi- 
nated from the Fourier series and can be formu- 
lated with two terms as a multiple linear regres- 
sion: 

y, =13, + P,t + l3,sin( — ) + l3,cos( — ) + e, 

r r 

Where J; denotes observed weekly Pneumonia and 
Influenza deaths at week / for flve years period; ; 
j— 0,1,2,3 are regression coefficients, as and /?, 
describe the linear part and /^^ and /^^ belong to the 
seasonal part; r is the time duration of fluctua- 
tions; and E, is an independent normally distri- 
buted error term with a constant variance. 
In fact, the Serfling method followed a two-step 
procedure. The flrst step was to determine a base- 
line describing the expected pattern of the histori- 
cal disease excess mortality. Since the baseline 
model estimated the non-epidemic phase of the 
disease, weeks that typically showed high disease 
incidence were excluded to avoid overestimating 



the parameters. The main problem in Serfling ap- 
proach was to determine the epidemic points in 
this stage. As a criterion, Pelat et al. (2007) pro- 
posed excluding the 20% highest values of data to 
account for past outbreaks in modeling the base- 
Hne (34). Then the estimated baseline was used to 
predict future time series of the expected disease 
rates. In the second step, an epidemic threshold 
was obtained by calculating an upper percentile 
for the prediction distribution according to the 
baseline. Consequentiy, an outbreak was detected 
while an observation exceeds the predefined thre- 
shold (28). Moreover, an automated web-based 
application of Serfling model has been con- 
structed by Pelat et al. for the retrospective and 
prospective surveillance of diseases (34). 

Hidden Markov Models 

A HMM is a statistical tool to flt a mixture distri- 
bution on a sequence of dependent data. The ap- 
plication of the models have been recognized in 
many areas, including automatic speech recogni- 
tion, electrocardiographic signal analysis, epileptic 
seizure frequency analysis, DNA sequence analysis, 
the modeling of neuron flring and meteorology 
(30). A HMM consists of a bivariate discrete time 
process like {S,,Y}^,>-,, where {i'^} is an unobserv- 
able Markov chain and, conditional on {S^, {Y^} 
is a sequence of independent random variables 
such that the conditional distribution of Y, only 
depends on S,. The sequences {J,} and {Y,} are 
often called state sequence and observed sequence, 
respectively (35). The dependence structure of a 
HMM can be represented by a graphical model as 
in Fig. 2. 




Fig. 2 : Graphical representation of the dependence 
structure of two sequences in a hidden Markov 
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Let S\ (t—1,2,...,n) represents a first order Markov 
chain which takes on one of the m values 
l,2,...,m by a transition matrix r—(a,^^^^^ and ini- 
tial probability distribution n—(np.. ., nj^, where 
- P(Sf-j I St_i-i) i,j = 1,2, ...,m; t- 1,2, . . .,n 
71, — P(S^—i) i — 1 ,2,.. .,m 
Moreover, the conditional distribution of given 
S^i follows a parametric form ffypO) where 6- is a 
vector of unknown parameters. Fitting a HMM 
to the data requires estimation of the parameters 
including the initial and transition probabilities 
and distribution parameters. There are mainly two 
approaches to estimate the parameters in the 
HMM literatures. The first can be achieved by a 
maximum likelihood technique using a modified 
EM-algorithm, known as the Baum-Welch me- 
thod. The other is Bayesian framework which as- 
siimes the parameters follow a prior distribution 
and then updates them through a Monte Carlo 
Markov Chain (MCMC) technique. Consequentiy, 
after the estimation of parameters, the most likely 
sequence of hidden states that produced the data 
should be decoded by use of the Viterbi algorithm. 
The following section explains how we used 
HMMs to detect unusual states of sputum smear 
positive pulmonary TB in Iran. 

Fitted model 

Two different occurrences were assumed for the 
disease in every week: usual phase which corres- 
ponds to the expected disease incidence and un- 
usual phase which occurs when the disease inci- 
dence increases significantiy compared to the 
usual phase, we described {Y,} at each time 
t—1,2,...,n to represent the observed number of 
new SS+ infections occurred over the week t, as 
Y, belongs to two distinct distributions associated 
with the disease states which have been mixed to- 
gether. A dichotomous variable {i",} was also de- 
fined to denote the usual and unusual phases of 
disease by taking two values 1 and 2, respectively. 
We accepted {S^} follows a Markov process, on 
condition that the disease phase in each week de- 
pends only on the previous state. In fact, the state 
sequence of disease {S,} is a virtual variable and 
cannot be measured directiy, because there is no 
specified criterion for TB to decide on either a 



usual or unusual state occurs based on the ob- 
served number of cases. Again, we described 
7T—(7T^ ,712^ to be the vector of initial probabilities, 
where tt, and tt^ exhibit the probabilities of arising 
usual and unusual states in the first week, respec- 
tively. And r—(aij)2*2 denotes the transition proba- 
bilities matrix, where ; 1 ,2 is the transition 
probability from state i to state j. Consequentiy, a 
Hidden Markov Model was obtained by assuming 
the conditional variables Y,,Y^, ... , y„, given the 
disease state S^, are independent random variables. 
We aimed to make an inference about the disease 
state at each time according to the observed dis- 
ease incidence. Fig. 3 illustrates a better represen- 
tation for the first chain of the model structure 
and related probabilities applied to the SS+ pul- 
monary TB data. 




Fig. 3 : Graphical representation of the first chain 
of HMM structure and related probabilities 
applied to the SS+ pulmonary TB data 

Since the data were considered as discrete counts, 
a mixture of Poisson-Poisson distribution was as- 
sumed to fit to the data. Thus, with a Poisson dis- 
tribution model we have 

P(X=s\S,= i) = e~'- k./s! i = l,2 

In order to consider the fluctuations of SS+ 
counts time series, we deflned two baseline mod- 
els for evaluating the expected disease counts (A,) 
and then, a Hidden Markov Model was applied 
for each desired model. The first model was ob- 
tained by assuming the expected disease load is 
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constant over the considered time period; that is 
the time series of SS+ counts does not follows a 
seasonal trend. The model parameters in this case, 
are A, and that should be estimated simulta- 
neously along with the related initial and transition 
probabilities. For the second model, we supposed 
the baseline of the number of new infections fol- 
lows a seasonal trend over the considered time 
period. Since the model parameters were consi- 
dered time-dependent in this case, another index 
(/) was added to the model parameters describing 
the time. We assumed then, a cyclic regression as a 
function of the model parameters for both states 
of the disease. Thus, for each state, the model can 
be written as follows: 

r r 

\ =^Y,\S, =2}=(l3,+l3j+l3f+l3,sin^}+licos^} 

r r 

Where j3; i—0,...,3 are the regression coefficients 
and reflects an increase in mean upon transition 
to an unusual phase. After constructing the two 
baselines, maximum likelihood estimation of all 
parameters were computed for both models ac- 
cording to the observed sequence of disease inci- 
dence using a Baum- Welch algorithm. Besides, in 
order to detect the anomaly phases of the disease, 
the most likely sequence of hidden states was de- 
tected by Viterbi algorithm based on the updated 
estimation of parameters. 

In addition, in order to evaluate and compare 
these two models, we measured the models good- 
ness-of-fit using two criteria. The first measure we 
used was adjusted coefficient of determination. It 
can be calculated as follows: 

Adjusted =l-(l-R^ ) 

n-k 

Where denotes the coefficient of determina- 
tion; n is the number of time points; and k stands 
for the number of free parameters in the model. 
The Bayesian Information Criterion (BIC) used as 
the second criterion. If 4^ denotes the maximized 
likelihood and k the number of free parameters, 
so BIC can be formulated as below: 



Then the model with the highest values of both 
adjusted and BIC was chosen as the appropri- 
ate one. Finally, the results of the selected model 
were compared with the Serfling epidemic thre- 
shold and both sensitivity and specificity indices 
were computed to check the model validity. 
AH computations were performed using R version 
2.14.1 (Free GNU license) and the source code is 
also available on request. 

Results 

The time series of SS+ data included 312 weeks 
associated with the six years of consideration from 
April 2005 to March 2011. Fig. 1 illustrates this 
time series from 2005 to 2011. According to the 
results, the observed weekly numbers of patients 
in Iran ranged between 60 and 168 cases with a 
mean and standard deviation of 97 and 20 cases 
per week. Table 1 show the monthly number of 
new SS+ cases and associated incidence rates dur- 
ing six-year period 2005-2011. The population 
estimations of Iran were obtained from the two 
nationwide censuses 1996 and 2006 for each year. 

Table 1: the number of new SS+ cases and associated 
incidence rates for each of the six periods 2005-2011 





Number 






Incidence 




of SS+ 


Population 




rates per 


Period* 


cases 


size 




100,000 


2005-2006 


4,561 


69,390,405 


6.57 


2006-2007 


4,811 


70,495,782 


6.82 


2007-2008 


4,677 


71,532,062 


6.54 


2008-2009 


4,880 


72,583,586 


6.72 


2009-2010 


5,086 


73,650,566 


6.91 


2010-2011 


5,171 


74,733,230 


6.92 



* Each period starts from the beginning of April to the end 
of the next March. 



Two HMMs were applied to the data by assuming 
both constant and seasonal trend as the expected case 
load of the disease. The updated estimation of para- 
meters were obtained using Baum- Welch algorithm 
after 50 and 1 52 iterations, respectively. The results of 
both models have been summarized in Table 2. 
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Table 2: Parameters estimation of two models with and without seasonality for both phases of the 

disease 



Initial 



Transition proba- 
bility 



Model parameters 



Model 


phase 


of states 


probably 


Usual 


Unusual 




l3i 


J32 


/33 1 


With seaso- 


Usual 


192 


1 


0.70 


0.30 


76.82 


0.07 


7.15 


7.42 


nality 


Unusual 


120 


0 


0.45 


0.55 


100.26 
A 


0.07 


7.15 


7.42 


Without 


Usual 


160 


1 


0.79 


0.21 


82.45 








seasonality 


Unusual 


152 


0 


0.21 


0.79 


111.24 









Moreover, the shifts in the mean number of pa- 
tients while moving to the abnormal phase were 
observed 23.44 and 28.79 in two models with and 
without seasonality assumption, respectively. Fur- 
thermore, in order to reveal the unusual phases of 
SS+ pulmonary TB, the most likely state sequence 
of models was detected by using the updated para- 
meters in Viterbi algorithm. Fig. 4 and 5 present 



the disease states, detected according to constant 
and seasonal baselines, respectively. 
Consequentiy, both adjusted R-squared and BIC 
were measured for the evaluation of the models 
goodness-of-fit. As Table 3 exhibits, the higher 
values of both criteria for the model with seasonal 
trend show a better goodness-of-fit than the mod- 
el without seasonality. 



Table 3: Model evaluation and goodness-of-fit for both models with and without seasonal trends as base- 

Une 



Model 


Number of 


R- 


Adjusted R- 


Log-likelih- 


BIC 1 


iterations 


squared 


squared 


ood 


With seasonality 


577 


0.72 


0.72 


-1316.56 


-1336.66 


Without seasonality 


50 


0.56 


0.55 


-1375.26 


-1386.75 



Therefore, the seasonal model was selected to 
compare with the Serfling epidemic threshold in 
the detection of disease anomaly states. Also, the 
upper 95th percentile of the prediction distribu- 
tion from the seasonal baseline in the HMM was 
calculated as the Serfiing threshold. Considering 
the threshold results as reference. Sensitivity and 
specificity were obtained 0.94 and 0.87, respec- 
tively. More information has been detailed in Ta- 
ble 4. 

In order to achieve a better illustration of variabil- 
ity and fiuctuations of SS+ pulmonary TB inci- 
dence during a year in Iran, we computed the av- 
erage for the number of weeks being in unusual 
phase for each month over 2005-2011. 



Table 4: Model validity indices for both baselines 
based on the Serfiing threshold as reference 



Model 




Validity Index 
Sensitivity Specificity 



With seasonaUty 0.94 0.87 
Without seasonality 0.68 0.73 



Fig. 6 was drawn to depict the changes of the pro- 
portion of weeks with unusual phases over 
months. For both models, the highest average for 
the number of unusual phases of the disease was 
obtained in March. 
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ID 



number of cases : 
|tate(normal/unusufal) 




2005 



2006 



2007 



2008 

time (weeks) 



2009 



2010 



2011 



Fig. 4: Unusual states of the weekly SS+ data in Iran over 2005-2011, obtained by applying by Viterbi al- 
gorithm in HMM without seasonality 
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Fig. 5: Unusual states of the weekly SS+ data in Iran over 2005-2011, obtained by applying by Viterbi al- 
gorithm in HMM with seasonality 
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Fig. 6: Comparison of the proportion of unusual phases attributed to SS+ pulmonary TB among months 
for both baselines, Iran, 2005-2011 
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Discussion 

In this paper, a powerful statistical tool was em- 
ployed to monitor the routinely TB surveillance 
data retrospectively, in an attempt at detecting ab- 
normalities and aberrations from the expected 
disease incidence. Although HMMs have been 
already applied to the flu-Uke disease, poliomyelitis 
(30), Malaria, Leprosy (36), nosocomial infections 
(37) and Hepatitis A (31) data, we found no litera- 
ture using the model in TB surveillance for the 
same purpose. This study demonstrated that a 
two-state HMM provides a conceptually simple 
framework to segment the sequence of disease 
counts of SS+ patients into expected and abnor- 
mal phases. 

In order to consider the natural variations of TB 
in Iran, Two models with and without seasonal 
trends were applied and parameters were trained 
using the data. The comparison of two models 
showed a better goodness-of-fit for the model 
which characterizes baseUne with a seasonal trend, 
having an adjusted R-squared of 0.72 and a BIC of 
-1336.66 (Table 3). As Fig. 5 shows, the SS+ 
counts data followed a seasonal trend during this 
six-year period 2005-2011. The trend characte- 
rized the disease time series with a considerable 
increase in the early of spring. A study from the 
South Africa also showed seasonal variation of TB 
rates with a higher incidence in the late winter to 
early of spring (38). The findings also presented a 
linear increase in the number of SS+ patients with 
approximately 7 cases per 1 00 weeks over the time 
period of the current study. 

According to Fig. 6, we found different patterns 
for two models in detection of unusual phases of 
disease. Under the constant assumption for base- 
line, the model demonstrates an upward trend in 
the number of detected cases with unusual phases 
from January and reaches its peak in March and 
then, decreases gradually by the end of the year. In 
the model with seasonality assumption, the aver- 
age for the number of detected weeks with un- 
usual phases exhibits a stable situation over 
months, except two considerable peaks in March 
and November. On average, the number of un- 



usual states detected by this model is lower than 
one without seasonal trends, especially during 
January to August. Nevertheless, both models we 
applied in this study show the same peak in March 
in the average of the weeks with unusual states. 
Therefore, we may conclude again that a higher 
incidence in SS+ pulmonary TB is occurred in the 
late winter in Iran. 

Since the data we used for the study presented the 
weekly number of infections, so a mixture of Pois- 
son probabiUtjr distribution was assumed to be 
followed by data. However, the large amounts of 
the weekly disease counts have caused a lack in 
the Model goodness-of-fit and made the UkeMh- 
ood to show a very small value. If we could Hmit 
our study population to a smaller region like a 
province or prisoner population, which are at 
higher risk of TB, we would expect to obtain a 
better goodness-of-fit. Another way was to use 
the disease incidence rates rather than disease 
counts. In this case, there were more available dis- 
tributions Uke Gaussian, exponential or a mixture 
of them to fit on the data. Another attractive al- 
ternative was to add other variables such as auto- 
regressive terms or some covariates closely asso- 
ciated with TB Uke socioeconomic status or cli- 
mate changes as a function of the model parame- 
ters to present better fitness for the disease base- 
line. 

In this study, we used a modified EM-algorithm 
known as Baum-Welch in some HMM literatures 
to estimate the model parameters based on a max- 
imum likelihood technique. As a limitation of the 
study, we did not compute the standard errors of 
the estimations in order to make inference or re- 
port their confidence intervals. Because it needs 
some other extensions like using bootstrap me- 
thod and is not addressed in the present study (37). 
Another alternative is using Bayesian approach 
which provides simpler framework for the estima- 
tion of parameters and making inference about 
them by a MCMC sampling algorithm (31, 36, 39, 
40). To this end, first a prior distribution should 
be introduced to the parameters and then updated 
by a MCMC algorithm given the observations. 
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